Practical Implementations of Principal Component Analysis

Nick Belgau, Oscar Hernandez Mata

2024-08-01

Introduction

About

  • PCA is a technique for extracting insights from complex datasets across various industries and data science applications.

  • The primary objective of PCA is to reduce the number of dimensions in the dataset while retaining most of the original information (Rahayu et al. 2017).

  • PCA utilizes linear algebra to transform the original dataset to a lower dimensional vector space of uncorrelated variables known as principal components.

  • These are linear combinations of the original variables, and represent the information in a new coordinate system with axes aligned to the directions of maximum variability (Joshi and Patil 2020).

Functionality

  • PCA can reveal unexpected relationships between the original variables that would otherwise be challenging to identify.

  • By reducing the dimensionality, PCA simplifies data structure, making it easier to interpret and analyze.

  • PCA identifies trends, patterns, and outliers in the reduced-dimension dataset (Joshi and Patil 2020).

Advantages & Limitations

Advantages

  • Helps mitigate the curse of dimensionality: data sparsity, multicollinearity, and overfitting (Altman and Krzywinski 2018).

  • Multicollinearity is mitigated because the principal components are uncorrelated to each other which improves stability and performance of the predictive models.

  • With dimensionality reduction, models become simpler which improves generalization and reduces overfitting (Bharadiya 2023).

  • In machine learning pipelines, PCA is often employed because reducing the number of dimensions decreases computational complexity, lowers memory requirements, and enhances algorithm efficiency.

  • PCA identifies the most influential variables. It effectively filters out noise and irrelevant variations (Bharadiya 2023).

Limitations

  • It assumes linearity which means that if nonlinear relationships exist, PCA will not be as effective because it may fail to capture the underlying structure in the data. However, modern variations such as Kernal PCA seeks to address this.

  • PCA is also sensitive to outliers, which can affect the quality of dimensionality reduction. However, there have been improvements to address outlier sensitivity such as Robust PCA which decomposes the data into a low-rank matrix and sparse matrix to separate signal and noise (Bharadiya 2023).

PCA applications

  • Image compression and classification. However, in real-world applications, PCA has largely been supplanted by more advanced methods such as compression algorithms like JPEG.

  • In the realm of image classification, while PCA can enhance the performance of traditional machine learning algorithms, modern techniques predominantly favor Convolutional Neural Networks (CNNs).

  • Despite these advancements, PCA remains crucial in data science for its ability to simplify complex datasets. However, it is essential to understand the contexts in which PCA is not the best choice and to consider alternative methods where they are more suitable (Ali, Wassif, and Bayomi 2024).

The Math behind PCA

SVD is faster and more accurate than eigen-decomposition

Although PCA is traditionally taught through eigen-decomposition of the covariance matrix, Singular Value Decomposition (SVD) is always used in practice.

  • Numerical stability: No need to square the covariance matrix (\(X^TX\)) which can amplify errors; robust against ill-conditioned matrices.
  • Efficient with large datasets: Directly decomposes the data without needing to compute and square the covariance matrix (Johnson and Wichern 2023).

sklearn.decomposition.PCA

stats::prcomp()

SVD decomposition
Ensure features are continuous and standardized (\(\mu\) = 0, \(\sigma\) = 1):
\[ X = U \Sigma V^T \] Each column of \(𝑉\) represents a principal component (PC) which are orthogonal to each other and construct the new axes of maximum variance.

Calculate explained variance by PC
The diagonal singular value matrix \(\Sigma\) corresponds to the strength of each PC:
\[ \text{variance_explained} = \frac{\sigma_i^2}{\sum \sigma_i^2} \]

Dimensionality reduction:
- Select PCs: Based on cumulative explained variance target (95%).
- Truncate \(V\): Select top PCs to reduce dimensions and transform \(X\) into a new feature space. \[ X_{\text{transformed}} = X V_{\text{selected}} \]

The effectiveness of PCA relies on satisfying these points.

  1. Linearity
    PCA assumes that resulting principal components are linear combinations of the original variables. Nonlinear relationships may lead to low covariance values which can lead to an undervalued representation of their signficance.

  2. Continuous data
    PCA begins by standardizing the data, so the features should come from continuous distribution. Because the scale of measurement conveys information about the variance, categorical variables should be handled separately.

  3. Data standardization

  • Scaling standardizes the variance of each variable to ensure equal contributions.
  • Mean-centering has a similar impact: ensuring that the principal components capture the true direction of maximum variance.
  • Outliers can also distort the PCs, so they should be identified and handled appropriately.

Application 1 - Demographic Data

Dataset Description

This application was inspired by a paper published by UWF (Amin, Yacko, and Guttmann) on Alzheimer’s disease mortality (Tejada-Vera 2013) (Amin, Yacko, and Guttmann 2018).

The dataset is derived from US census data and contains demographic, health, and environmental metrics for counties in the United States. The dataset was filtered to select counties in the deep south.

Column
Obesity Age Adj
Smoking Rate
Diabetes
Heart Disease
Cancer
Food Index
Poverty Percent
Physical Inactivity
Mercury TPY
Lead TPY
Atrazine High KG

Check Assumptions: Continuous Variables

  • The variables appear to be continuous because the data types are “numeric” with high cardinality.
  • There are no nulls.
  • It is clear that scaling and mean-centering will be needed.
skim(data)
Data summary
Name data
Number of rows 1143
Number of columns 11
_______________________
Column type frequency:
numeric 11
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
obesity_age_adj 0 1 31.65 3.76 19.00 29.16 31.26 33.95 46.92 ▁▆▇▂▁
Smoking_Rate 0 1 25.06 3.56 10.72 22.94 25.62 27.68 32.86 ▁▁▃▇▂
Diabetes 0 1 10.64 1.62 6.48 9.32 10.56 11.69 17.92 ▂▇▆▁▁
Heart_Disease 0 1 126.68 38.46 41.20 99.90 120.10 146.85 279.20 ▂▇▃▁▁
Cancer 0 1 187.78 26.55 75.33 170.25 188.26 204.40 370.64 ▁▇▆▁▁
Mercury_TPY 0 1 0.02 0.07 0.00 0.00 0.00 0.00 0.94 ▇▁▁▁▁
Lead_TPY 0 1 0.14 0.30 0.00 0.01 0.04 0.14 2.80 ▇▁▁▁▁
Food_index 0 1 6.60 1.30 0.00 5.90 6.70 7.40 10.00 ▁▁▃▇▁
Poverty_Percent 0 1 19.21 6.67 0.00 14.90 18.80 23.15 47.70 ▁▇▇▁▁
Atrazine_High_KG 0 1 4531.16 24239.04 0.00 94.50 632.80 3480.10 768660.60 ▇▁▁▁▁
SUNLIGHT 0 1 17689.04 1037.82 15389.96 16897.70 17723.25 18285.74 21671.87 ▃▇▆▂▁

Check Assumptions: Linarity Analysis

Visually checking scatter plots is not a realistic method for inspecting linearity in real-world applications. Correlation plots are insufficient.

Code
harvey_collier_test <- function(data, x, y) {
  formula <- as.formula(paste(y, "~", x))
  model <- lm(formula, data = data)
  test <- harvtest(model)
  p_value <- test$p.value
  return(signif(p_value, digits = 2))
}

variables <- names(data)
n <- length(variables)
p_matrix <- matrix(NA, n, n, dimnames = list(variables, variables))

for (i in 1:n) {
  for (j in 1:n) {
    if (i != j) {  # Avoid testing a variable against itself
      p_matrix[i, j] <- harvey_collier_test(data, variables[i], variables[j])
    }
  }
}

library(ggplot2)

p_matrix_long <- melt(p_matrix)
names(p_matrix_long) <- c("Var1", "Var2", "p_value")

p_matrix_long$Var1 <- factor(p_matrix_long$Var1, levels = rev(unique(p_matrix_long$Var1)))

alpha = 0.0001

gradient_fill <- scale_fill_gradientn(
  colors = c("#215B9D", "#F0F0F0", "#F0F0F0"),
  values = scales::rescale(c(0, alpha, 1)),
  na.value = "#F0F0F0",  # Also set missing values to light grey
  guide = "colourbar"
)

ggplot(p_matrix_long, aes(Var1, Var2, fill= p_value)) + 
  geom_tile(color = "white") +
  geom_text(aes(label = sprintf("%.2e", p_value)), color = "black", size = 2) + 
  gradient_fill +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
        axis.text.y = element_text(size = 10),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "none") + 
  labs(title = "Heatmap of P-Values") 
  • Note: the assymetry is due to X~Y and Y~X for recursive residuals.

Check Assumptions: Linarity Analysis (continued)

  • The residual plot for a single pair that was flagged as nonlinear reveals why the Harvey-Collier Test declared nonlinearity.
  • Transformations may risk altering other variable relationships, so no transformations were applied, acknowledging some information loss in PCA.
Code
data_residual <- as.data.frame(data)

data_residual$residuals <- residuals(lm(Diabetes ~ obesity_age_adj, data = data_residual))

ggplot(data_residual, aes(x = obesity_age_adj, y = residuals)) +
  geom_point() + 
  geom_smooth(method = "loess", se = FALSE, color = "blue") +  # LOWESS curve
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals of Diabetes vs. obesity_age_adj",
       x = "obesity_age_adj", y = "Residuals") +
  theme_minimal()

Check Assumptions: Outliers Analysis

note: the mean-centering and scaling is handled within the PCA implementation

  • Outliers can distort PCA results by disproportionately increasing variance, shifting the direction of principal components, and inflating eigenvalues.
  • Although PCA does not require normality, a roughly normal distribution minimizes the impact from outliers.
  • Non-normal data may undergo transformations like the Box-Cox to approximate normality.
  • Assessing skewness and kurtosis offers practical insights into distribution characteristics.
                   Kurtosis   Skewness
Atrazine_High_KG 866.377802 27.6792372
Mercury_TPY       72.724213  7.4402444
Lead_TPY          34.286296  5.0314443
Cancer             5.489565  0.3562768
Food_index         4.878495 -0.8643685
Heart_Disease      3.867908  0.8405355
Poverty_Percent    3.795263  0.4674652
obesity_age_adj    3.680773  0.2879323
Smoking_Rate       3.531763 -0.7543274
Diabetes           3.344971  0.5427094
SUNLIGHT           2.974585  0.3380277

Check Assumptions: Outliers Analysis (continued)

                 Lambda
Lead_TPY            0.1
Mercury_TPY        -0.1
Atrazine_High_KG    0.1

Multicollinearity Analysis

Correlation
- While not a complete diagnosis, identifying highly correlated variables can be used to approximate the effectiveness of PCA.
- Recall that PCA eliminates multicollinearity because the PCs are orthogonal (Altman and Krzywinski 2018).

Variation Inflation Factor (VIF)
- Consider a scenario where ‘Heart_Disease’ is chosen as the dependent variable to model. - Multicollinearity between variables such as ‘obesity_age_adj’ and ‘Diabetes’.

Code
library(car)
data_df <-data.frame(data_transform)
model <- lm(Heart_Disease ~ ., data=data_df)

data.frame(VIF=vif(model))
                      VIF
obesity_age_adj  4.321218
Smoking_Rate     1.812287
Diabetes         4.009033
Cancer           1.640111
Mercury_TPY      2.098779
Lead_TPY         2.143279
Food_index       2.159996
Poverty_Percent  2.228750
Atrazine_High_KG 1.085929
SUNLIGHT         1.227205

Principal Component Analysis

  • Mean-centering and scaling was executed at time of PCA execution.
  • What’s more important - information retention or dimensionality reduction?
  • Common practice is to aim for 70-95%, but depends on the application.
  • The first few components are crucial for capturing the major variance in the data:
Code
pca_result <- prcomp(data_transform, center=TRUE, scale.=TRUE)
pca_summary <- summary(pca_result)
importance <- as.data.frame(pca_summary$importance)
importance <- as.data.frame(t(importance)) # transpose to make cleaner

importance$Eigenvalues <- pca_result$sdev^2
colnames(importance) <- c("Std Dev", "Proportion", "Cumulative Variance", "Eigenvalues")
importance <- importance[, c("Std Dev", "Eigenvalues", "Proportion", "Cumulative Variance")] # rearrangeo
importance
       Std Dev Eigenvalues Proportion Cumulative Variance
PC1  1.9273424   3.7146486    0.33770             0.33770
PC2  1.3267963   1.7603883    0.16004             0.49773
PC3  1.2316793   1.5170338    0.13791             0.63564
PC4  0.9782909   0.9570531    0.08700             0.72265
PC5  0.9144745   0.8362637    0.07602             0.79867
PC6  0.7843754   0.6152447    0.05593             0.85460
PC7  0.7010809   0.4915145    0.04468             0.89929
PC8  0.6479885   0.4198892    0.03817             0.93746
PC9  0.5455382   0.2976120    0.02706             0.96451
PC10 0.5085014   0.2585737    0.02351             0.98802
PC11 0.3630131   0.1317785    0.01198             1.00000

Scree Plot

plot(pca_result, type = "l", col = "#215B9D", lwd = 2)
  • The scree plot shows an elbow at the fourth principal component, indicating a point of diminishing returns.
  • The first four components might be a sufficient summary of the data.

Eigenvectors (loadings)

Code
eigenvectors <- pca_result$rotation
first_four_eigenvectors <- eigenvectors[, 1:4]
first_four_eigenvectors
                         PC1         PC2         PC3          PC4
obesity_age_adj   0.45539911 -0.08923142  0.04616352 -0.007211041
Smoking_Rate      0.37885662  0.04248172 -0.28743323  0.128820773
Diabetes          0.43751888 -0.11214803  0.07923426  0.080961617
Heart_Disease     0.23446146  0.05725530 -0.35130391 -0.119261323
Cancer            0.35550386 -0.02396108 -0.27508389  0.182844585
Mercury_TPY      -0.07090470 -0.66554141 -0.12306134  0.194309805
Lead_TPY         -0.06505621 -0.68349434 -0.02491826  0.092634475
Food_index       -0.32036082  0.06299314 -0.49456146 -0.040732247
Poverty_Percent   0.37456589  0.02855038  0.35864042 -0.028913891
Atrazine_High_KG  0.09967273 -0.22949368 -0.02370473 -0.936369219
SUNLIGHT         -0.11906414 -0.07901314  0.56599160  0.059355967
  • PC1: Represents general health and lifestyle factors, indicating that increases in obesity, smoking, diabetes, heart disease, cancer, and poverty are correlated with poorer health outcomes.

  • PC2: Reflects environmental exposure, with higher levels of Mercury and Lead decreasing the component score, indicating negative impacts.

  • PC3: Highlights the relationship between sunlight, food quality, and poverty, showing that decreased access to quality food correlates with increased poverty levels.

  • PC4: Suggests a potential link between the agricultural chemical Atrazine and cancer, indicating environmental health risks.

Biplot

  • A biplot visualizes the first 2 PCs, showing the data projection and how each variable contributes to the PC (arrow magnitude and direction).
  • Variables that are orientated in the same direction or in the complete opposite direction are correlated - a redundancy in information.
Code
autoplot(pca_result, 
         data = data_transform,
         colour = 'grey',
         loadings = TRUE,
         loadings.colour = '#215B9D',
         loadings.label = TRUE,
         loadings.label.colour = 'black', 
         loadings.label.size = 3) + 
  theme_minimal() +
  theme(legend.position = "none"
)
  • Food_index is negatively correlated to obesity_age_adj and Diabetes which contributes redundant information to the dataset.
  • Positively correlated groups are found due to the small angle between the vectors: Mercuryy_TPY and Lead_TPY; Heart_Disease, Smoking_Rate, and Poverty; and so on.

Application 2: Image Classification

  • PCA is frequently touted as an effective technique to improve image classification tasks through dimensionality reduction (Li et al. 2012).
  • In this application, the effectiveness of PCA was evaluated for a Support Vector Machine (SVM) algorithm and compared to the results from a modern Convolutional Neural Network (CNN).
  • The results from this analysis were used as a feasability assessment for real-world applications in an IoT environment.

Data Loading and Normalization:

  • The CIFAR-10 dataset contains 10,000 images of 10 different classes of labeled objects.
  • This data was loaded and normalized to have pixel values between 0 and 1 in preparation for training SVM and CNN machine learning models.
  • Training, validation, and test sets were created.
Code
import tensorflow as tf
from sklearn.model_selection import train_test_split

# Loading the training and test sets
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()

# Normalize the pixel values to range 0-1
X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255

# Split the training set to create a validation set
X_train, X_validate, y_train, y_validate = train_test_split(
    X_train, y_train, test_size=0.15, random_state=42)

Principal Component Analysis (PCA)

Code
from sklearn.decomposition import PCA

# Flatten the X data
X_train_flat = X_train.reshape((X_train.shape[0], -1))
X_validate_flat = X_validate.reshape((X_validate.shape[0], -1))
X_test_flat = X_test.reshape((X_test.shape[0], -1))

# Initialize PCA and fit on the training data
pca = PCA(n_components=0.95)
pca.fit(X_train_flat)
PCA(n_components=0.95)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code

# Transform both the training and testing data
X_train_pca = pca.transform(X_train_flat)
X_validate_pca = pca.transform(X_validate_flat)
X_test_pca = pca.transform(X_test_flat)
  • After flattening the data, PCA was applied to retain 95% of the explained variance.
  • This significantly reduced the dimensionality of the dataset from 3072 dimensions to <250.
Code
import matplotlib.pyplot as plt
import numpy as np

n_components = pca.n_components_
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)

# Plot the explained variance
plt.figure(figsize=(8, 4))
plt.plot(cumulative_variance)
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance')
plt.grid(True)

# Annotate the number of components used
plt.annotate(f'components: {n_components}', 
             xy=(n_components, cumulative_variance[n_components-1]),  # This places the annotation at the point where the number of components is reached
             xytext=(n_components, cumulative_variance[n_components-1] - 0.10),  # Adjust text position
             ha='center')

plt.show()

Principal Component Analysis (PCA)

  • The original images were compared to the PCA-reconstructed images, illustrating that PCA retains moderate image quality despite significant compression.
Code
import matplotlib.pyplot as plt

def plot_images(original, reconstructed, n):
    plt.figure(figsize=(10, 4))
    for i in range(n):
        # Plot original images
        ax = plt.subplot(2, n, i + 1)
        plt.imshow(original[i])
        plt.axis('off')
        if i == 0:
            ax.set_title("Original", loc='left')

        # Plot reconstructed images
        ax = plt.subplot(2, n, n + i + 1)
        norm_image = (reconstructed[i] - np.min(reconstructed[i])) / (np.max(reconstructed[i]) - np.min(reconstructed[i]))
        plt.imshow(norm_image)
        plt.axis('off')
        if i == 0:
            ax.set_title("PCA Reconstructed", loc='left')

    plt.show()

# reconstruct the PCA data into 32x32x3 arrays
X_train_reconstructed = pca.inverse_transform(X_train_pca)
X_train_reconstructed = X_train_reconstructed.reshape((X_train.shape[0], 32, 32, 3))
plot_images(X_train, X_train_reconstructed, n=5) # plot first 5 images

Modeling: Support Vector Machine (SVM)

  • Traditionally, SVM has been used for image classification.
  • An SVM model was created on the original flattened data and then compared to an SVM model using the PCA-reduced data.
  • An rbf kernel was used for nonlinear separation.
Code
import pickle
from sklearn.metrics import accuracy_score
import time
import pandas as pd

def load_pickle(path_pkl):
    with open(path_pkl, 'rb') as file:
        pickle_file = pickle.load(file)
    return pickle_file

def evaluate_prediction_time(model, X_test, n=100):
    X_test = X_test[:n]
    start_time = time.time()
    model.predict(X_test)
    total_time = time.time() - start_time
    return round(total_time, 2)

# load models
model_svm_path = '../model/svm.pkl'
model_svm_pca_path = '../model/svm_PCA.pkl'
model_svm = load_pickle(model_svm_path)
model_svm_pca = load_pickle(model_svm_pca_path)

# load predictions
prediction_path_svm = 'ml_result/test/prediction_svm.pkl'
prediction_path_svm_pca = 'ml_result/test/prediction_svm_pca.pkl'
preds_svm = load_pickle(prediction_path_svm)
preds_svm_pca = load_pickle(prediction_path_svm_pca)

# calculate accuracy
accuracy_svm = round(accuracy_score(y_test, preds_svm), 3)
accuracy_svm_pca = round(accuracy_score(y_test, preds_svm_pca), 3)

# evaluate prediction time
pred_time_svm = evaluate_prediction_time(model_svm, X_test_flat)
pred_time_svm_pca = evaluate_prediction_time(model_svm_pca, X_test_pca)


# Display results for better visualization
results = pd.DataFrame({
    'Model': ['SVM', 'SVM with PCA'],
    'Accuracy': [accuracy_svm, accuracy_svm_pca],
    'Prediction Time (s), n=100': [pred_time_svm, pred_time_svm_pca]
})
print(results)
          Model  Accuracy  Prediction Time (s), n=100
0           SVM     0.536                        10.7
1  SVM with PCA     0.533                         1.1
  • The results show that the PCA model achieved similar accuracy but significantly reduced the prediction time.
  • This demonstrates the effectiveness of dimensionality reduction in speeding up predictions without compromising accuracy.

Modeling: Convolutional Neural Network (CNN)

  • It is important to recognize situations where PCA may not be the optimal choice - CNNs have become the gold-standard for for image classification tasks.
  • PCA is typically not used before input to a CNN because it destroys the spatial complexity by flattening the data structure (Goel, Goel, and Kumar 2023).
  • Architecture: 9-layer CNN, Conv2D layers, ‘softmax’ for multi-class probs.
model_cnn = Sequential([
    Input(shape=(32, 32, 3)),
    Conv2D(32, 3, padding='valid', activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    Dropout(0.25),
    Conv2D(64, 3, activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    Dropout(0.25),
    Conv2D(128, 3, activation='relu'),
    Flatten(),
    Dense(64, activation='relu'),
    Dropout(0.50),
    Dense(10, activation='softmax'),
])
  • Model architecture addresses overfitting by using Dropout and MaxPooling2D.

Model Evaluation

  • Evaluations on test data
Code
from tensorflow.keras.models import load_model
import os

# load the model
cnn_model_path = 'model/cnn_tf213.keras'
model_cnn = load_model(cnn_model_path)

# evaluate accuracy
test_loss_cnn, test_accuracy_cnn = model_cnn.evaluate(X_test, y_test, verbose=0)
test_accuracy_cnn = round(test_accuracy_cnn, 3)

# evaluate prediction time
pred_time_cnn = evaluate_prediction_time(model_cnn, X_test)

1/4 [======>.......................] - ETA: 0s
4/4 [==============================] - 0s 5ms/step
Code
# display performance metrics for all 3 models
new_row = pd.DataFrame({
    'Model': ['CNN'],
    'Accuracy': [test_accuracy_cnn],
    'Prediction Time (s), n=100': [pred_time_cnn]
})
results = pd.concat([results, new_row], ignore_index=True)

def get_file_size(file_path):
    size_bytes = os.path.getsize(file_path)
    size_mb = size_bytes / (1024 * 1024) # convert to megabytes
    return round(size_mb, 1)

# append new column for model size
size_svm = get_file_size(model_svm_path)
size_svm_pca = get_file_size(model_svm_pca_path)
size_cnn = get_file_size(cnn_model_path)
results['Model Size (MB)'] = [size_svm, size_svm_pca, size_cnn]
print(results)
          Model  Accuracy  Prediction Time (s), n=100  Model Size (MB)
0           SVM     0.536                       10.70            894.7
1  SVM with PCA     0.533                        1.10             65.2
2           CNN     0.705                        0.19              2.6
  • Prediction time was nearly 5x faster than the SVM with PCA model and with a model size of 2.6 MB and accuracy over over 70%.
  • These qualities make CNNs a great choice for real-time image classification, deployed directly on IoT devices without prior dimensionality reduction.

Conclusion

  • PCA Simplifies Data: Principal Component Analysis (PCA) reduces the dimensionality of complex datasets, transforming original variables into uncorrelated principal components.

  • Key Benefits: Captures key patterns in data, addresses multicollinearity and overfitting, and enhances computational efficiency and model performance.

  • Challenges: Sensitive to outliers and may lose important information in nonlinear relationships.

  • Application in U.S. Counties: PCA applied to demographic, health, and environmental data in the deep south of the U.S. revealed four principal components explaining 72% of the variance, covering health, socioeconomic, environmental pollution, and chemical factors.

  • Image Data Analysis: While traditionally used in image compression and classification, PCA has been largely supplanted by Convolutional Neural Networks (CNNs), which offer superior accuracy and efficiency.

  • CNN vs. PCA: In a CIFAR-10 dataset analysis, CNNs achieved higher accuracy (70.5%) and faster prediction times compared to PCA with SVM (53% accuracy), making CNNs more suitable for real-time processing needs.

  • Continued Relevance: Despite the rise of advanced techniques like CNNs, PCA remains valuable in machine learning pipelines for tabular data, highlighting the importance of choosing the appropriate method based on the specific application.

Ali, Ibrahim, Khaled Wassif, and Hanaa Bayomi. 2024. “Dimensionality Reduction for Images of IoT Using Machine Learning.” Scientific Reports 14: 7205. https://doi.org/10.1038/s41598-024-57385-4.
Altman, Naomi, and Martin Krzywinski. 2018. “The Curse(s) of Dimensionality.” Nature Methods 15 (6): 397–400. https://doi.org/10.1038/s41592-018-0019-x.
Amin, R. W., E. M. Yacko, and R. P. Guttmann. 2018. “Geographic Clusters of Alzheimer’s Disease Mortality Rates in the USA: 2008-2012.” Journal of Prevention of Alzheimer’s Disease (JPAD) 3.
Bharadiya, Jasmin Praful. 2023. “A Tutorial on Principal Component Analysis for Dimensionality Reduction in Machine Learning.” International Journal of Innovative Research in Science Engineering and Technology 8 (5): 2028–32. https://doi.org/10.5281/zenodo.8002436.
“Compression of Spectral Data Using Box-Cox Transformation.” 2014. Color Research & Application 39 (2). https://doi.org/10.1002/col.21771.
Goel, Akash, Amit Kumar Goel, and Adesh Kumar. 2023. “The Role of Artificial Neural Network and Machine Learning in Utilizing Spatial Information.” Spatial Information Research 31: 275–85. https://doi.org/10.1007/s41324-022-00494-x.
Harvey, A., and P. Collier. 1977. “Testing for Functional Misspecification in Regression Analysis.” Journal of Econometrics 6: 103–19.
Johnson, Richard, and Dean Wichern. 2023. Applied Multivariate Statistical Analysis. Pearson.
Joshi, Ketaki, and Bhushan Patil. 2020. “Prediction of Surface Roughness by Machine Vision Using Principal Components Based Regression Analysis.” Procedia Computer Science 167: 382–91. https://doi.org/10.1016/j.procs.2020.03.242.
Li, Jun, Saurabh Prasad, James E Fowler, and Lori M Bruce. 2012. “PCA-Based Feature Reduction for Hyperspectral Remote Sensing Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 50 (1): 370–83.
Maureen, Nwakuya Tobechukwu, Biu Emmanuel Oyinebifun, and Ekwe Christopher. 2022. “Investigating Instability of Regression Parameters and Structural Breaks in Nigerian Economic Data from 1984 to 2019.” International Journal of Mathematics Trends and Technology 68 (12): 67–73. https://doi.org/10.14445/22315373/IJMTT-V68I12P509.
Rahayu, S., T. Sugiarto, L. Madu, Holiawati, and A. Subagyo. 2017. “Application of Principal Component Analysis (PCA) to Reduce Multicollinearity Exchange Rate Currency of Some Countries in Asia Period 2004-2014.” International Journal of Educational Methodology 3 (2): 75–83. https://doi.org/10.12973/ijem.3.2.75.
Tejada-Vera, Betzaida. 2013. “Mortality from Alzheimer’s Disease in the United States: Data for 2000 and 2010.” NCHS Data Brief, no. 116: 1–8.